Skip to content

Iterator predicate pushdown

Brian Femiano edited this page Jan 20, 2014 · 25 revisions

Custom StorageHandler implementations can use the Hive API to analyze one or more predicates that appear in a query WHERE clause. The custom Serde used by the storage handler is better equipped to incorporate the predicate(s) into query restrictions for the underlying datastore, which in many cases will be more efficient than Hive.

The AccumuloStorageHandler delegates to AccumuloPredicateHandler to aid in the decomposition of WHERE clause predicates involving (<, >, <=, >=, !=, =) over any of the defined Hive columns, including the rowID. Predicate conditions involving these operators are converted to one or more IteraterSetting used to filter key/value pairs directly from the TabletServers.

The Hive API currently only allows predicate decomposition to the storage handler based WHERE clauses that consist of purely conjuctive predicates.

Consider for example the query: SELECT * FROM table1 WHERE foo = 'hi' AND bar = 'there';

The above query consists of two predicates (foo = 'hi') and ('bar' = 'there'). Because they are combined into a single conjunctive restriction, they both are available to the StorageHandler.

In the case of SELECT * FROM table1 WHERE foo = 'hi' OR bar = 'there'; The Hive analyzer notices the disjunction, and neither of the predicates are available to the storage handler.

If even a singe disjunctive condition is found by the Hive predicate analyzer, the entire clause is evaluated at the Hive filter-level, and is not available for direct storage handler decomposition. So in the case of SELECT * FROM table1 WHERE foo = 'hi' AND bar = 'there' OR 'foo2' = 'whoa';, despite having two predicates ANDed together, the presence of the final OR operator means nothing can be passed down. This is a current limitation of the Hive API.

Hive Predicate -> Accumulo Iterators

Since predicates made visibile to StorageHandler implementations are limited to conjunctive conditions, we can assume predicates available to the AccumuloStorageHandler are meant to be ANDed together.

Predicates received by the storage handler are mapped 1-to-1 with a filter IteratorSetting. The Compare API in the storage handler makes use of a single class PrimitiveCompareFilter that extends WholeRowIterator and performs the filtering for a given qualifier.

As part of the Compare API, each IteratorSetting is supplied as part of the configuration options four critical parameters necessary to carry down the Hive column filtering to the TabletServers.

  1. The column family and qualifier corresponding to the Hive column in question. (The 'age' in "age > 5")

  2. The constant value being compared against. (The '5' in "age > 5").

  3. An implementation of PrimitiveCompare. Each instance is datatype specific and takes as an argument the constant value for comparison. The different methods encapsulate how to compare incoming byte[] values with the constant, and return either true or false.

  4. An implementation of CompareOpt, which abstracts the comparison operation to hide the underlying data type. It takes an instance of PrimitiveCompare at construction time. (The '>' in "age > 5")

The iterator uses Reflection to dynamically build and instantiate the correct CompareOpt and PrimitiveCompare objects at query scan time.

Predicate pushdown is enabled by default. You can turn off iterator pushdown on a per-table basis by adding the property 'accumulo.no.iterators' = 'true' in the SERDEPROPERTIES for the table definition.

Gotchas

To use the custom iterator, make sure accumulo-hive-storage-handler-1.6.0-SNAPSHOT.jar is available on the classpath for each TabletServer. Otherwise queries involving WHERE clauses will produce errors.

An error will occur if a key/value pair is identified where the constant type being compared against does not match the Hive column type. In the future this could be handled with type hints directly in the config mapping, similar to the latest HBase StorageHandler.

Odd behavior can occur for key/value scans that involve matching identically labeled qualifiers across different rowIDs where the value types diverge. For instance (not accounting for cell vis and ts) if row1 has an integer stored at cf/qual 'a', but row2 has a double value. Handling this is an ongoing work in progress.

RowID -> Ranges

One last thing worth mentioning. For predicates involving a Hive column mapped to rowID, AccumuloPredicteHandler will attempt to do Range restrictions instead of Iterator filtering.

For example, the following SELECT

SELECT * from table1 where rowid >= "5555" will configure a single Range where the start row is greater than or equal to '5555', and the stop row is positive infinity. Multiple predicates involving the rowID will be converted to the equivalent number of Range restrictions.

Clone this wiki locally